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ABSTRACT 

Tailored ■♦■esting procedures for achievement testing 
were applied m a situation that failed to m*^et scae of the 
specifications generally considered to be necessary icr tailored 
testin<j« Di screpancie;^ trcm the appropriate conditions included the 
use or small samples for calibrating items, and the use of an item 
pool that was not designed to te homogeneous in content. Tae item 
pool contained 180 items concerning educational aeasure-ueiit that were 
calibrated separately by a one-pa ramata r logistic model and by a 
three-parameter logistic mcdel. The 110 undergraduate students were 
each tested at, ^wc sessions a week apert with both one-parameter and 
three -para meter tailored tests at each sesrsion. All tests were 
administered on a computer terminal. The results were studied for 
several characteristics including: goodness of fit of the observed 
responses to those predicted ty each lodel: the information function 
of each test compared to a fifty-item traditional {aj-er-and- pencil 
test; the reliabilities of the tailored tests; and the content 
validities of the tailored tests. A unidinensional tailcr€d test of 
vocabulary was also administered, with satisfactory results. The 
achievement tests generally produced unsatisfactory results, 
presumably because of the discrepancies from appropriate conditions. 
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Tailored testing has frequently been proposed as an innovative solution 
to many a9e*old measurement problems ♦ In particular, tailored testing- 

1 — < procedures can theoretically alleviate many commonly encountered problems 

with conventional, p£iper-and--pencil multiple choice tests ♦ One problem 
with conventional tests, in whicii all the examinees are administered the 

^ — I same questions, is that test items are often of inappropriate difficulty 

for many examinees • An examinee with low ability may be frustrated by 

\ ij the difficult items on the test and, therefore, will resort to random 

guessing or to item omissions ♦ On the other hand^ an examinee with a 
high ability level will often find many test items to be too easy and 
unchallenging* In general, there is a tendency for conventional tests 
to be most appropriate and accurate for measuring the average examinee. 
Vhis tendency is reflected by the fact that the standard error of measure*- 
munt of a test is usually higher at the extremes than in the middle of the 
ability range. The result cf iaprecise measurement, of course, is lower 
overall test reliability* 

Tailored testing procedures (Lord, 1970,* Weiss, 1974) have been developed 
to alleviate these and other problems with conventional tests, but we will 
see that, in so doing, a whole new host of problems may be introduced ♦ 
The purpose of the present paper is to describe some of these difficulties 
which became evident while conducting tailored testing research at the 
University of Missouri -Columbia* First, however, it may be helf>ful to 
brietly discuss the rationale behind tailored testing and some primary 
character i sties * 

One major distinguishing feature of tailored testing is its attempt 
to administer test items of appropriate difficulty level to each examinee* 
That is, rather than administering the same sat of test items 'to all examinees, 
the procedures atteopt to tailor make" the test for each individual. 
This is accomplished by the selection of items for administration that 
approximately match item difficulty parameters to an examinee's estimated 
ability level after each response to an item, resulting in efficient measure-- 
\mnt that facilitates the control of test eirors* 
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However, in order to implement tailored testing it is usually necessary 
to utilize computer capabilities for several steps in the procedure. 
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Tailored testing itself is often bated on latent trait or .item dlaracteristic 

curve (ICC) theory (Lord, 1952? Lord and Novick, 1968) which involve rela- 
tively sophisticated mathematical models* In addition^ the procedures 
require a precalibrated pool of items to be available for selecting the ^ 
test items to be administered. Thi« is usually accompli s^hed by ^'submitting 
item response data from soitie conventional test to one of Several existing 
latent trait calibration programs (Wright and Panchapakesan^ 1969? Wood, 
Wingersky, and Lord, 1976 7 and Urry, 1975) in order to obtain item parameter 
estimates such as difficulty, discrimination! and guessing indexes. 

Another required step is the development of a computer program to 
operate the tailored testing procedure in an actual test setting on an 
interactive basis with the examinee* In developing this program, many 
decisions must be made as to the operational characteristics-^f the test 
itself: (a) the entry point into the item pool (^e first item adminis- 
tered) , (b) the ability estimation procedure to be utilized (usually either 
a Bayesian or maximum likelihood technique) , (c) the n^ethod used to select 
successive items, given responses on the previous items, and (d) a stopping 
rule to terminate the test* 

As might be e^cpected, numerous problems may arise that must be dealt 
with in order to establish tailored testing as a viable alternative to 
conventional testing. In particular, the item calibration and ability 
estimation phases of tailored testing present special difficulties* These 
will be considered in greater detail later in this paper, but it will suffice 
for now to ftote that, first, sample size is an important detenoinant of 
item calibration quality (Reckase, 1977)* Moreover, calibration weaknesses 
may be compounded when data from several small san^le calibrations are 
linked together using items in common to form a larger item pool. Another 
problem that may occur under certain circumstances in the nonconvergence 
of ability estimation^ procedxures ♦ Finally, some of the assumptions of the 
latent trait models may be violated in tailored testing procedures, resulting 
in problems when, for example, an extension is made from ability testing 
to api^lications in achievement testing • 

* 

Latent Trait Models 

The Rasch (1960), or one--paramater logistic (IPL) model, has been 
thoroughly described by Wright (1977). In general, the IPL model re^taires 
only one ability parameter# 9 j # for each person and one item diffi* 
culty parameter, b^, for each item in order to represent the interaction 
between an ex2Uidnee and a test item. The exponential form of the IPL 
model is 

«xp{u. .(9. - b )) 

P(u..) « ., ^ ^^,J r^-r (1) 

13 1 + expO. - b. ) 

J i 

where u^j is the score (0 or 1) on Item i by Person j, 8j and bj^ are as 
defined above, and Pto^j) is the probability that u^j is eqaal to 0 or 1. 



In contraat, the three-parameter logistic (3PL) model presented by 
Birnbaun (1968) requires the estimation of three item parameters to represent 
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the interaction between test items and exaininees. The model is ^^en by 
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where J^(uj^j « 1) is the probability of a correct response by Person j to 
IteiD ij is the guessing parameter for Item i; D is a scaling constant 
equal to 1.7; a^ is the item discrim^ination parameters b^ is the item diff i-^ 
culty parajneter; and 0^ is the ability parameter for Person The probab- 



Both models have in comroDn the assungptions that the items are scored 
dxchotomously^ that the latent trait being measured by the items is uni- 
dimensional, that the model describes the interaction between a person 
ctnd an item, and that local independence holds (Lord and Novick, 1968) 
This last assuir^tion simply means that the probability of a certain response 
to any given item on a test is unaffected by any previous response. 

The unidimensionaJity assumption has particular relevance when consider- 
itigmg tailored testing applications to ability tests compared to achievement 
tests. In the former case, factor analytic procedures usually yield one 
dominant factor being measured by the test items. Certainly this is the 
case for ability measures such as verbal or quantitative aptitude, and 
often is the case for intelligence tests. 

Oft the other hand, achievement tests are usually constructed with 
multidimensional measurement as a primary goal. Since most achievement 
tests are baaed on the objective of sampling distinct content areas or 
domains^ multidimensionality inevitably seems to be built into the tests. 
With this being the case, thie unidimensional assumption of latent trait 
measurement needs to be examined for achievement test applications of 
tailored testing. The present study brings evidence to bear on this issue 
and will be discussed m detail later. However, it is convenient as a 
basis for comparison to first summarize the results of a previous study 
reported on tailored testing applied to unidimensional vocabulary ability 
measurement (Koch and Reckase, 1978). 



The ^purpose of the study was to compare the IPL and 3PL models in 
a tailored testing application to vocabulary ability measurement* A counter- 
balanced test-retest design was ^!¥>loy^<i in which there were two separate 
test sessions one week apart for each examinee, with both the IPL and 3PL 
tests administered at each sejision^ The calibration programs used to obtain 
item parameter estimates for 72 item vocabulary pool were the Wright and 
{^anchapakesan (1966) program for the IfL model and the TJiGlST program 
(Wood, Wingersky* and Lord, 1976) for the 3PL model • Test items were 
selected for administration based on the information function (Birnbaum, 
1968), and maximum; likelihood ability estimation was used. 

In general the results d^nonstrated that tailored tests based oA either 
of these two latent trait models co\ild be successfully applied to voc^ulary 




Vocabulary Tailored Testincjf Study 



ERIC 



4 



tibility mea^ureimsnt. However, there were several s^iecific areas where 
one tailored test performed better than the other. For exa2r^ple^ the 

test was found not only tc have inore total test information than the IPL 
test, but also to have a better fit between the empirically obtained responses 
and those predicted by the model than the IPL model. 

In regard no reliability, the 3PL procedure resulted in a significantly 
higher reliability coefficient than th% IPL test* The values, which reflected 
a coTTiDination of test test and equivalent forms reliability, were r « 
♦ 77 and r .61, respectively. However, it cannot be too highly eii4>hasi2ed 
that the 3PL procedure, in conjunction with maximum likelihood ability 
estimation^ failed to converge at ability estimates in nearly one-third 
of the tailored tests. With these nonconvergence cases included in the 
reliability calculation, the correlation coefficient for the 3PL tests 
dropped to r « ,36, With raaximmn likelihood scoring being a major technique 
for ability estimation, the nonconvergence phenomenon constituted a serious 
problem* The hypothesis was forwarded that the non convergence was due to 
the item pool being too difficult overall for numerous examinees. It is 
important to note that nonconvergence of ability estimation does not occoir 
in conjunction with the IPL model. 

Tailored Achievement Testing 

One interesting application of ICC tlieory was reported by Brown aiid 
Weiss (1977) in which a tailored testing procedure was used for an achieve-^ 
ment test with multiple content areas. This research nicely demonstrated 
that an adaptive testing strategy utilizing inter-^subtest branching sub- 
stantially reduced the total test length while, at the s^^ne time, providing 
equal precision of measurement coir^ared with the conventional achievement 
test battery. However, this application to multidin^nsion^*! achievement 
measurement did not address the issue of the robustness of ICC theory with 
respect to the violation of the unidimensionality assumption. This was 
due to the fact that each subtest or content area was calibrated separately, 
ratlier than having one calibration of a multidimensional item pool. Nor 
was there any attempt to investigate another crucial aspect of achievement 
testing, namely content Validity. The current study provided an opportmity 
to examine both the robustness of the ICC model and the content validity 
of tailored achievement testing. 

>ffi:THOD 

Item Pool Construction 

Calibration . The items calibrated for use in the study were obtained 
from a series of classroom achievement tests which were administered as 
part of an undergraduate course in educational measurement. Response 
data were collected from a total of 11 separate So item multiple choice 
exams, most having 4 alternatives per item, covering the content area of 
educational evaluation techniques • All of the tests were calibrated with 
both the Wright and Panchapakesem (1969) program and the LOGIST program 
<Wood, Wingersky, and Lord, 1976) which yielded the IPL and 3?L item piira- 
meter estimates, respectively. The sample sizes ranged from 96 examinees 
to 314 examinees, although most of the tests had sample sizes of about 
200. 
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The claaaxooa t#«ta themselves had be^n produced according to ttsdi^ 

tional achievement test construction principles. Items were included on 

the exems if they had moderate to high point biserial discrimination indexes, 

and in such a manner that the average test difficulties were cIoto to 

.75, Being achievement t.sts, a table of specifications was used to construct 

the tests to match course objectives* KR-20 reliabilities for the* exams 

were consistently found to be in the range from -^^eo to ^•SS. 

Linkxng . Since all of the achievement tests had numerous items in 
common across tests ^ item calibration linkings were performed in order 
to form a larg^ item pool for tailored testing. In this procedure the 
goal is to link all the separate item parameter calibrations into one 
final set of item parameters such that parameter estimates (Stained from 
different samples are put onto a single scale. Of course it would be 
::nore convenient to have a single large sample of examinees (say 1,000 
or lii ^re) to which a single test of l5o or more items could be administered • 
In this latter situation^ the need for item parameter linking would be 
eliminated, and more stable item parameter estimates would be obtained 
as well* 

Unfortunately, in the typical classroom situation it is rare to have 
more than 100 examinee;^ taking a single test at one point in time. Moreover, 
for test security reasons. It is usually necessary to construct a new 
form of tJhte exam for each new class ^ although numerous items may overlap ♦ 
Thus we are confronted with a situation in which many different small sample 
size calibrations are required to obtain item parameter estimates. One 
resulting problem is that the parameter scales for each separate calibra^ 
tion are indeterminate. But it, is important to note that the parameter 
estimates are equivalent within a linear transformation. This means that 
the very desirable attribute of latent trait or ICC models referred to ^ 
as invariance of item parameters (Lord and Novick, 1968) is still main- 
tained. 

For space reasons, the present paper will only briefly describe the 
procediires used to link the separate item calibrations together into onei 
large pool of 180 items for tailored testing, (Reckase, 1979, provides : 
a thorough discussion of item linking techniques.) In the current study 
one of the tests was arbitrarily designated as the calibration base for 
linking. Then the 3PL item discrimination estimates and the IPL item 
difficulty estimates were linked from all the separate test calibrations 
onto the same scales as their corresponding item parameters in the calibra- 
tion base* The linear transformation incorporated the use of multiplicative 
constants in the case' of the 3PL linking and additive constants in the 
'case of the IPL linking. The 3PL item difficulty parameters were linked 
by means of •i]i(>le linear regression « The 3PL item guessing parameters 
required no transformation since they were already on- the same 0 to 1 
scale ^ but they were combined using a weighted average procedure^ 

Table 1 presents the means, standard deviations, and ranges of the 
item paraneter estimates resulting from the calibration and linking proce- 
dures described above. Both the IPL and 3PL item pools contained exactly 
the same itens^ The correlation between their respective item difficulty 
parameters was ^91* The distributions of the difficulty parameters were 
markedly peaked rather than taking on a uniform distribution which would 
have been preferred, based on previous research • 
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Insert Table 1 about here 



Tailored Testing PrQcedaj^es 

The three required con^nePts of the tailored testing, procedure in- 
eluded <a) an item selection routine, (b) an ability estimation technique, 
and (c) a stopping rule to terminate the test. These components have been 
described elsewhere (Koch and Reckase, 1978; Patience, 1977), but they 
will be jTummarized heje* 

For both the IPL and the 3PL procedures, items were selected for 
administration which maximized the value of the information function 
(Bimbaum, 1968)* The information fxinction described the potential 
contribution of each item to the estimation of a given examinee's ability 
level. Item infornuition for the IPL procedure was computed as 

exp{-(0, - b,)] 

1(0 , X ) - ^ ^ i r~- v(0. - bj (3) 

^ {1 + exp[-(0, - b,)]}"^ 3 ^ 

3 ^ 

where 1(0^, u^j) is the information of Item i at ability level 9 for Person 
j', given :J tem response Uj^j, with Qj and b^^ having the same meanings as ^ 
given in formula 1, and ^ix) is the logistic probability density function. 

For th(^ 3PL procedure, item information was calculated as 

1(0. , u_) - D^a.^^j^roL. (0.)] - D^a.P. . (G , ) i|; [dL , (0 J - log cj (4) 
3 13 1 1 J 1 D ^ 3 ^ 

where iCOj, u^j) is the infor.na .ion as defined above; Lj^(8j) » (9;^ - bi); 
PxjCOj) ifi the probability of a correct response to Item i given ability 
l^^vel Uj; ^;(x) is the logistic probability density function; and the other 
parameters have their definitions gi^en previously. The total tes^ infor- 
mation was then simply the sum of the item informiition (Birnbaum, 1968) 
given by: 

t 

n 

I{0) - E 1(0., u. .) (5) 

In tlie tailored testing procedure , the examinee's, initial ability 
estimate was remdomly assigned to be either +.50 or -•50. The first item 
to be administered was selected such that the information function was 
maximal for the Initial ability estimate « If the examinee answered the 
first item correctly, the new ability estimate was placed at a fixed step- 
size (^693) away in a positive direction <i«e. a iiiore difficult item>. 
An incorrect response resulted in^ an ability estimate that was -^693 avay* 
A fixed stepsiM was only used until a maximum likelihood ability estimate 
could be obtained. In boch cases, the item administered was the one with 
maximum information for the given ability estimate • When at least one 
correct and one incorrect respon^ie were obtained, the ability level of 
the examinee was estimated using an empirical maximtua likelihood procedure, 



with the mode of the likelihood function beconiing the n^;^ ability estimate. 

The next item administered was the one in the item pool with maximiOT infor- 
mation for .that ability estimate, with the restriction that no item could 
bi: administered more than once during the test. 

The tailored tests for both the IPL and tl)e 3PL procedures cycled 
through this process until one of two stopping rulfs was reached: either 
no item remained in the item pool with an information value greater than 
iip^^cifaed amount, or a maximum of 20 items had been administered. 

Dus ;,<jn 

The study employed a counterbalanced designed in which there were 
two separate test sessions one week apart for each examinee, with both the 
IPL and the 3PL tests administered at each session. The cotanterbalancing 
resulted from tiie reversal of the presentation order of the test models 
used from one test session to the next* The test^retest feature of the 
d^fsicn was planned to facilitate reliability comparisons between the two 
tailored testing procedures. The tests were arranged so that the examinee 
could not perceive receiving two tests during each s^^ssion* The adminis- 
tration of the tests was accomplished on Applied Digital Data Systems 
(ADDS) Consul 980 cathode ray txibe teiminals which vero conn^^cted to an 
IBM 370/168 through a timesharing system. 

Sampjfe: 

The subjects participating in the study were junior and senior ciider-^ 
graduate students enrolled in an introductory course in measurement and 
evaluation* Shortly after the students had taken their first course exam, 
they were asked to volunteer to tc*ke other tests over the same material , 
but in shortened form on a compute: terminal* In order to provid^=^ some 
motivation, the instructor informed each student that the tailored tests 
would be used to assign a course grade if his or her performance was better 
than the score on the conventional course exam. A total of 110 students 
took part* 

Analyses 

The primar>' research issues in the achievement test study included 
compariaons of (a) the respective test-retest reliability coefficients 
for the IPL and 3PL tailored testing procedures, (b) the goodness of fit 
of t±»e two models using mean squared deviations of observed from predicted 
response data, ijnd (c) the total test information fvmctions for the two 
tailored testing methods. Also of interest were comparisons of the ability 
estimates yieldftd by t!ie two procedures, the content validity of the tailored 
testa, and the correlation cf the ability estimates with the conventional 
course exam. 

The reliability coxnparison was based on correlations between the 
ability estinates yielded by the IpVand 3PL procedures in the two test 
sessions. These coefficients were not strictly test-retest reliabilities 
since no exajninee could possible receive exactly the same tailored test 
twice, due to different starting points in the item pool and different 
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j>Ath3 through the pool. Therefore, the reliebility coefficients reflected 
a mix between test-retest and equivalent faros reliability. The respective 
reliabilities for the two procedures were coiopared statistically using a 
t-'test based on Fisher's to 2 transformation. 

The measure used to determine the goodness of fit of the observed 
data to the models was the mean squared deviation (MSD) statistic , which 
was calculated by svunning the squared differences for each person betv*een 
the actual response to an item and the probability of a correct response 
predicted by the itiodel. These squared differenced were computed using 
the formula 



I {n. . ' P. 
i-1 

MSD . » ~ (6) 

where MSDj wes the mean squared deviati6n for Person j * u^^ was t* e actual 
response to Item i by Person j# P^j was the probability of^a correct response 
to Item i by Person j , and n was the number of items in the tailored test 
tor Person A systematic sample of 29 examinees was analyzed to coit^are 
the IPL and 3PL tests using the MSD statistic as the dependent variable 
in a t-test. The sampling was systematic rather than random to insure 
that the fit comparison covered the whole range of ability estimates* 

The total test inforjLation analyse^ were performed to compare the 
IPL and 3PL procedures in terms of relative efficiency {Blmbaum, 196Cj * 
The relative efficiency was the ratio of infcnoation provided by each 
procedure's tailored test to the information provided by the traditional 
50 item paper -and -pencil course exam* Again, the plot constructed for 
the relative efficiency comparison was based on a selected sample of cases 
across the whole range of tailored testing ability estimates. 

The content validity analyses were conducted to determine J^e degree 
to which both the item poolr and the tailored tests accurately represented 
the medsurement of the course objectives that had been specified. Since 
a table of specifications was used to construct the traditional course 
exam, a particular weighting of test it<^ms to content areas was assured * 
The issue was whether or not the item pools and tailored tests reflected 
the same weightings « A set of ciii Square euialyses were performed to deter- 
mine the fit between desired and observed content distributions in this 
respect. 

Other analyses included descriptive statistics for the two types of tailored 
tests, including average test length, average test difficulty, numoer of 
items actually used from the item pools, etc. In addition, several correla- 
tions were coioputed, such as ability estimate intercorrelations across 
models and correlations of tailored test scores with regular .exam scores « 
Finally^ a principal coapcnents analysis of the traditional course exam 
iras run to determine its structure • The puirpose was to determine if the 
av:hievement test was truly mltidimensional. 
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Goodness of Fit 

In Table 2 are presented the results for th^ MSD statistic used in 
the goodness of fit comparison of the IPL and 3PL models. The computed 
MSD values for 29 cases for each model are shown, along with the means, 
standard deviations, and the results of a dependent t-test analysis of 
tne data. The results indicated that the MSD statistic. was significantly 
smaUer for the 3PL tailored testing procedure (£<.01), reflecting better 
"It of the 3PL model to the observed responses. 

Information Function Analyses 

■ The relative efficiency coa^arison of the total test information for 
the IPL and 3PL procedures is shown in Figure 1, 



Insert Table 2 about here 

The horizontal broken line indicates the information of the traditional 
50 item course achievement test as the standard for comparing these two 
types of tailored tests. However, the ability scale used for plotting 
the IPL relative efficiency curve is not the same as that for the 3PL 
relative efficiency curve. Even so, a subjective visual comparison of 
ti\e two is possible. 

Insert Figure 1 aibout hsre 

In general, the plots indicate that. neither tailored teat procedure 
was as infqpjvative as the conventional course exam. However, the relative 
information of the 3PL procedure came substantially closer to the tradi- 
tional paper-and-poncil exam than did the IPL tailored tests. This finding 
was in contrast to the vocabulary tailored testing study results (Koch 
and Reckase, 1978) which showed the 3PL procedure to hav«i more information 
than the conventional test, while tiie IPL procedure had almost as much 
information as the conventional test. The overall shape of the informa- 
tion relative efficiency curve was somewhat irregular for the IPL tests, 
but it was pftMked for the 3PL tests. Also, the IPL procedure had its 
highest relative efficiency at the upper extremes of ability where very 
few examinees were classified, while the 3PL tests were roost informativa 
precisely in the ability range that encompassed most of the examinees. 

Reliability 

The correlation matrix in Table 3 r«poit« the coefficients obtained 
from int«rcorr«l«ting the ability estimatca yielded by the two models 
in the tailorad testing study. The .44 correlation between the ability 
estimates frofi the first IPL test (IPL 1) and the second IPI^ test (IPL 
2) was the reliability coefficient for that procedure. This value, although 
by no neans high, was significantly greater <£<,01) than the ^OO relia- 
bility coefficient obtaiaed from the 3PL tailored testing procedure (3PL 
1 vs« 3PL 2). Neither tailored testing procedure attained a reliability 
that approached the traditional 50 item paper-and-pencil form of the test 



« ,74)* Although both tailored testing reliabilities were disturb- 
ingly low, the 3PL .00 rc»liability was of particular concern. One factor 
which iiT^pacted on the reliability gf tho 3PL procedure was th»i occurrence 
of nonconvergencti of the maximuni likelihood ability estimation for 9 out 
of the* llO cases. Nonconvergencv is a frequ^intly ^.encountered problem 
when using maximum likelihood ability estimation m conjunction witli the 
3PL model* (Recall that nonconvergence occurred in almost one-third of 
tlv- vocabulary tailored tests previously mentioned.) 

The deletion of these 9 cases from the reliability correlation analyses 
r*>^ulted in the coofficionts showti in parentheses in Table 3, The IPL 
reliability increased slightly from ,44 to .46 and the 3PL reliability 
went from .00 to .12, When these reliabilities were adjusted witJi the 
Spearman -Brown formula to approximate the length of the 50 item paper- 
and-pencil test^ the IPL coefficient went up to .68, while the 3PL coef- 
ficient increased tu both btill being lower thdn tl^e reliability 
of the traditional test, (Lord (1977) has questioned thc^ use of Spearman- 
Brown ^corrections for tailort^d test reliabilities.) 



Insert Table 3 about here 

To H^^arch further for sources of th*.^ low 3PL n liability, ability 
' :^tinvito;> were examineu to locate individual examinees with widely diff- 
^^rincj WL ability scores from one test session to the next. Ten such 
cahi*i; were identified and studied in detail. A definite pattern emerged 
winch reflected problems in the operating procedure ^ of th^> tailored i:ests. 
All 10 cases were sit uut ions in which one of t))<^ tailored tei^ts was only 
1 or 4 items long, while the other was 20 items m length. The short 
teiit resulted when the examinee answered the initial and all the subsequent 
it^^ms correctly • Since tliere was never both a correct and incorrect response 
fK> maximum likelihood ability estimate could Le computed. Thus each sue-- 
;e;^sive item administered was n¥Dre difiicult by a fixt,M} stepsize of ai>out 
.b*J on the ability scale. Ordinarily this .*ould not be a problem with 
a good quality item pool. However, the achievement test item pool had 
Anly 2 out of 180 items above the zero point on the item d'ifijulty scale. 
Moreover, the entry point into the pool had bean set at ^.50 or -♦50. 
Hio result was that it was piossible for an examinee to happen to answer 
the first 3 or 4 tailored tests items correctly and "top out** the item 
pool. When these cases of unreliable 3PL ability estimation were thrown 
out, th3 3FL test reliability went up to .43* Obviously this was achieved 
only through substantial •'massaging" of the data. It should be noted that 
the skewness of the item difficulties resulted ir.air.ly from th^^ item link- 
ing procedures discussed earlier. 

Another problem with the 3PL tailored tests was that ti<e item pool 
WdS functionally limited to only about iu out of the .^80 items. Since 
items were selected for administration based on the information function, 
only tliose items with relatively high item discrimination values were 
administered. The effect of this artificial restriction in the 3PL item 
pool was an overlap of more than fiO% between the items administered from 
the first test session to the next. However, item repetition over tests 
was minimal for the IPL tests* It seemed likely that conoiK:>n items across 
tests would favorably affect the 3PL reliability. However, partial correl- 



ation analyses indicated that the proportion of items in common had a 
negligible effect on the test reliability in previous research. 



In Table 4 are listed the correlations computed between tlie tailored 
test ability estimates and scores on the paper-and-penci ? :ourse exams. 
In general, the correlations »?ere relatively low. This was also true 
for the tailored test correlations with Exam 1, even though the tests 
covered the same content areas. 



Table 5 presents some descriptive statistics for both test sessions 
of the two types of tailored tests, since the administration of a maxi- 
mum of 20 items was one stopping rule for the tests, the values for the 
mean number of items administered indicate that roost of the tests went the 
full distance. Tliis result implied that ample n\xmbers of items were avail- 
able in the item pool which had sufficient information for roost of the 
examinees. The mean test difficulty values reflected the overall low 
difficulty of the items for the majority of the students, since the mean 
proprotion of items correct would have been expected to be .50 if the ilems 
were of exactly appropriate difficulty, assuming no guessing. The standard 
deviations of the ability estimated revealed that the scores yielded by 
the 3PL tailored tests had a restricted range compared to tlie IPL tests, 
, at least when the 10 unreliable cases were removed from the analyses. 



content Validity 



As can be seen in Table 6, both the IPL and 3?L iter pools used for 
the tailored tests accurately reflected the weighting of the content areas 
in the paper~and-pencil course exam, of course both item pools had identi- 
cal content area breakdowns since the two pools contained the same items. 
A Chi Square analysis indicated no lack of fit for the number of items in 
each content area of the pools compared to the corresponding number of items 
on Che course exam. However, the nvunber of items administered by content 
area for a systematic sample of 29 tailored tests showed significant lack 
of fit to both tha item pools and the course exam. The fit of the 3PL 
tailored tests in terms of content validity was parti c^ularly bad, while 
the IPL tests cane fairly close to matching the content area weightings 
of the item pools and the course exam, it should be noted that no conscious 
atterapt was made in the tailored testing operating program to require 
branching aaong the content areas. The object was just to see if selecting 
items for administration on the basis of information would approxiitate 
tlie content area weightings of the item pools and the course exam. 



Other Correlation Analyses 



Insert Table 4 about here 



Dtsscriptive Statistics 



Insert Table 5 about here 



Insert Table 6 about here 
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Goodness ot Fit 

The superior fit of the observed reaponaes to those predicted by 
th«i 3PL model was expected' based on previous rese^^rch (Koch and Reckase^ 
1978? Beckase, 1<>77) . It was not surprising that a model with three item 
parameters was ahle to fit observed response data better than a model 
with only one item parameter • Since the MSD values reflected an average 
fit across the resjponse string for an examinee ^ the implication can be 
made that the 3PL tailored tests demonstrated better ^'person fit'* than 
t^ie IPL tests. 

Information TuRCtion Analyses 

The results of the relative efficiency comparisons shown in Figure 
1 clearly demonstrated the inadequacy of both the IPL and the 3PL tailored 
achievement tests ccaupared to the traditional paper-and-oencil achieve- 
ment test. This result was contrary to the findings of previous tailored 
testing research with vocabulairy ability tests. In the latter case, 3PL 
tailor*^d tests averaging 19 items were more than twice as informative as 
the 30 Item conventional vocabulary test at certain points on the ability 
scale. Since the achievement tailored tests averaged only about 20 items 
m length compared to the 50 item course exaro^ a drop was « expected in the 
tailored test relative efficiency^ This was predicted since total test 
information is just the sum of the item information* However, it was not 
expected that the IPL tailored tests would be only about half as informa- 
tive <ind the 3PL tailored tests only about 80% as informative as the conven- 
tit^aal course exam* No conclusive explanation could be identified for this 
r«-suit. Perhaps the item parameter linking procedures were at fault. 

Certainly it was true that the tailored tests had more information 
on a per-item basis. However, that is beside the point. Part of the 
nverit of tailored tests is that a shortened test may be as informative 
ajDout an examinee* 3 ability as the conventional full length test, which 
is accomplished through more accurate measurement by the administration 
of only the appropriate test items. Clearly, further research is required* 
A final curious result was tliat the 3PL tailored tests were more informa- 
tive than the IPL tests in the ability range where most of the examinees 
were concentrated, evrm though the IPL tailored tests were significantly 
more reliable. 

Reliability 

The reliability results provided another setback for the tailored 
tegting proc*durt». As has bten mentioned earlier, the previous vocabu- 
lary tailored tasting study yielded adacjttataly high reliabilities for 
both the IPL and the 3PL proceduxee^ the values being r * .61 and r « 
.77, respectively* But the tailored a<iiievement test reliabilitit^s did 
not even approach the co^irse exam reliability. Moreover, the 3Pt proce- 
dure had aero reliability, for which several contributing factors were 
identified • 
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One major problem was that the itwn par«ji»ter linking* resulted in 

a soaaowhdt showed and shifted distribution of thf^ 3PL difficulty para- 
meters so that only about 30 out of 180 items were above the zero point 
o:i thy scale. This outcome in combination with the tailored test opera- 
tional procedures of the +.S0 entry i>oint and the fixed stepsize resulted 
m unreliable tests for numerous examinees. In hindsight, the entry point 
into the item pool should have been shifted downward on the ability scale 
so that approximately an equal nuraber of items were above and below the 
starting point. In that situation, examinees who weru able to answer the 
fxrst few items correctly would not have been able to "top out" the item 

» 

Nonconvergence of maximum likelihood abilily estimation was another 
problem with the 3PL tailored tests* When the very large number of non-- 
convergence cases was observed in the previous vocabulary study, the hypo- 
thesis was forvarded that excessively difficult items were the cause, 
where long strings of incorrvict responses were obtained. In such a case 
no reasonable maximum likelihood ability estimate could be calculated since 
the likelihood function approached a uniform distribution with the mode 
at the guessing level • Since the achievement tailored tests were based 
on the >^aminee*s regular course material over which they had been previously 
tested, the ncnconvergence problem was reduced somewhat, with only 9 out 
of 110 failuret to converge. Several approaches are currently being studied 
to resolve the ncnconvergence problem, including the alternative of sub-^ 
stituting Bayesiar ability estimation in place of maximum likelihoods 

Since neither of the problems discussed immediately above applied 
to the IPL tailored tests, an'>ther explanation must be found for the low 
rt^liaijility of that procedure. ,The most obvious candidate is the multi- 
dimensional ity of the test. Since the principal components analysis of 
the regular course exam indicated the presence of 20 factors with eigen- 
values greater than one, it was Obvious that the unidimensional assumption 
uf the latent trait models had been violated^ Therefore, the low IPL 
reliability could hav3 simply been a result of ths violation of that 
assumption. O*^ course, the same argument would apply to the 3PL tailored 
tests ♦ If indeed future research shows tJvat the latent trait models are 
not robust with respect to the violation of the unidimensionality assump- 
tion, then each content area of achievement tests will have to be identified 
and calibrated separately. In addition, intricate branching schemes will 
have to be devised so that the tailored tests can provide ability estimates 
for each content area* Scoring would then becoioe a problem in terms of 
weighting the content areas « If the content areas were correlated some- 
what, it maght be possibl* to use regression methods to predict the appro- 
priate entry point into a new content ajrea, given an ability estimate 
on the previous content area (Brown and Weiss ^ 1977). 

Content Validity 

The content validity results demonstrated that, even though the item 
pools may reflect proportionate content area weightings to a conventional ^ 
test, the tailor^ed tests using the item pools should not necessarily be 
ei;pected to reflect the same weightings* For the IPL procedure this result 
was somevrtiat of a surprise^ if the assuii^jtion is made that ability is normally 
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difttribuM. In such « cas«, th« tailortd tmntn thould have perfozmed 
similarly to a random sa0|plin9 proceaa f rea the item pools « However, 
for tha tailored testa , only the most discriinina>:ing items were 
administered, regardless of content areas, since items were selected 
for administration on the basis of the infonaation functi<^. Item dis- 
crimination values do not come into play tor the IPL procedure since they 
are all assumed to be one. Perhaps if a larger saiuple than 29 tailored 
tests had been analyzed, the IPL procedure would have achieved adequate 
content validity. 

In contrast, 3PL tailored testing procedures will unaoubtediy require 
branching schemes from one contc^nt area to another in order to insure 
adequate weighting of all the content areas ^ In this regard, content valid- 
ity might be more appropriately measured in terms of amount of information 
or precision of miiasurement in each content area ratlier than just number 
of items, 

SUMMARY AND CQNCTOSIOa 

•The results of applying tailored testing procedures to the measure^ 
ment of unidimensional vocabulary ability were generally satisfactory • 
Reliabilities amd information were comparable to or better than the con- 
ventional test for both the IPL and 3PL tests • However, tailored tef^ting 
applied to nultidiransional achievement measur^nent presented many diffi- 
culties. Both the IPL and 3PL procedures were inadequate with regard to 
reliability, test information, and content validity. Possible causes 
-were the small sample sis&es used to calibrate the tests, resulting in 
unsted:>le item parameter estimates i a compounding of the instability of 
the parazfteter estimates during linking procedures i the possibility that 
latent trait models may not be roburjt with respect to violation of the 
unidimensionality assumption by mult i --content achievement *testsj and the 
nonconvergence of the 3PL tailored tests when using maximum likelihood 
ability estimation ♦ 

one way to look at the present study is to view it as an example of 
mistakes not to make in tailored achievement testing « From perhaps a more 
reasonable perspective, the study illustrates that very little can be taken 
for granted in setting up tailored testing procedures ^ Rather, one must 
carefully make decisions about the operational procedures, while coni:ider- 
ing the effects that such decisions might have. A great deal more research 
must be conducted to determine optimal levels of the various components 
that control tailored testing procedures « A study by Patience and Reckase 
(1979) is an, important step in this direction • 
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Table 1 

Descriptive Statistics of Item Parameter 
Estimates for Tailored Testing item Pools 



One-Parameter Three-Parwiieter Calibration 
Calibration 







a. 
1 


b. 
1 


c. 
1 


Mean 


.518 


.75B 


-1 . 764 


.238 


S. D, 


1.505 


.720 


3.800 


.115 


Low Value 


-3.165 


,010 


-9.999« 


.000 


High Value 


5.437 


3.537 


21.518 


.500 


No. of It«IR8 


180 


100 


180 


160 



*¥his value was an artificial lower limit on 
the 3PL difficulty parameters. 
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T«ble 2 

Goodness of Fit Ccnip«rison 
Using the MSD Statistic 



UDS6 ITVa t XOHS 


One Paraiaeter 

MSD 




Three Paraioeter 
MSD 


X 


• 2136 




aiis 


:> 






*2745 




OAT C 




.1507 


A 
H 






.1806 








.1471 


V 


• x^Ud^ 




* 1216 


7 


« X^X / 




.0979 




01 Oil 




.2207 








.2047 




• ^UDx 




. 2311 


1 1 

X JL 


« XO / / 




1642 


X ^ 


. xyyu 




.2086 


X J 


* ly^x 




.1897 


1 d 






«2132 


15 


177S 




1 1 
« xt>XD 


xo 


• ^uo4 




.094^ 


17 


.2216 




.0966 


18 


.1797 




.1166 


x^ 






. 1723 


20 


.2198 




.2554 


21 


.1560 




.0962 


22 


.2133 




.1210 


23 


.2040 




.1012 


^4 


.2182 




.2841 


25 


.2034 




.0762 


26 


.2434 




.2061 


27 


.1962 




»0672 


28 


.2175 




.1620 


29 


.2168 




.2649 


X 


.2046 




.1649 


s- 

X 


.0426 




.0701 



t 3 727 

-(28) 



<£ < ,01) 



Table. 3 

Ability Estimate Correlations 



Variables 


1 


2 


3 




4 


1. IPL 1 


1.00 


»44(.46)^ 


.05( 


.31) 


.12{.24) 


2. IPL 2 




1.00 


ai( 


.33) 


.19<.13) 


3. 3PL 1 






1.00 




.00(,12) 


4, 3PL 2 










1.00 



^(n « 110 cases) 



(reliabilities when n ^ 101^ due to deletion of 9 non- 
convergence cases) 



Table 4 

Correlations of Ability Estimates 
With Traditional Course Exains^ 



Variables 


IPL 1 


IPL 2 


3PL 1 


3PL 2 


Exam 1 


.30 


.41 


.42 


.09 


Exaok 2 


.35 


.28 


.17 


.20 


Exam 3 


.31 


.22 


.27 


.20 


Total Score 


.57 


.48 


.41 


.23 



<n ■ 101, since 9 nonconvergence cases were deleted 
from the analysis) 
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Table 5 

Tailored Test Descriptive Statistics* 





Ona-Parameter 


Three-Parameter 


Variable 

• 


Tailored Test 


Tailored Test 




Session 1 Session 2 


Session 1 Session 2 



Mean # of items administered 19.56 
Mean # of itens correcv 12.59 
Mean proportion of items correct .64 
Mean of ability estimates 1.74 



S.o. of ability estimates 



19.72 
12.42 
.63 
1.75 



.87(.86)k> .80(.77) 



19.18 
13.64 

.71 

.06 

.61 (.27) 



18.10 
12.98 

.72 

.18 

.79{. 



(n " 101, due to deletion of 9 nonconvergence cases) 

(n " 91, due to deletion of 10 cases with unreliable 3PL 
ability estimates) 



Table 6 



Test Items by Content Area for Course fyam 
: Item Pools « and Tailored Teats 



Course Exan 



Items in 



It^SHT in 



Itetns in 
29 IPL 



Items in 
29 3PI. 



380 



180 



541 



548 



Content 


Items 




IPL 


Pool 


3PL 


Pool 


Tailored Tests 


Tailored Tests 


Areas 


Number 





Number 


% 


Nuatber 




Number 





Nurdjer 


% 


Anecdotal 

lie cords 


5 


10.0 


17 


9.4 


X? 


9.4 


49 


9 ^ 


57 




Behaydor 

Objectives 


S 


10.0 


18 


10,0 


18 


10.0 


56 


10.3 




5 . 1 


Checklists 




10,0 


)7 


9.4 


17 


9.4 


59 


10.9 


51 


9.3 


Peer 

Appraisals 


2 


4.0 


7 


3.9 


7 


3.9 


13 


2.4 


0 


0.0 


Planning 
Tests 


3 


6.0 


13 


7.2 




7.2 


48 


8.9 


47 


8.6 


Rankings 


3 


6.0 


U 


6,1 


11 


6.1 


26 


4.8 


10 


1,8 


Ratings 


b 


12.0 


23 


12.8 


23 


12.8 


75 


13.9 


111 


20.3 


Selection 
Items 


H 


16.0 


26 


14.5 


26 


14.5 


76 


14 .0 


111 


20.3 


Self Report 


JL 


4.0 


7 


3.9 


7 


* 3.9 


32 


5.9 


45 


8.2 


Supply 
Items 


l\ 


10.0 


19 


10.6 


19 


10.6 


62 


11.5 


26 


4.7 


Table of 
Specs . 




12.0 


22 


12,2 


22 


12.2 


45 


8.3 


62 


11.3 



Notei Listed below are the Chi Square vt^lues for several comparisons. The critical values for 
rejection of adequate fit is x^UO) > 18»31 at a * .05. 

1. COiirse exam items vs. items in IPL pool, " .9978 

2. Course exam items vs. items administered by IPL tailored tests, = 28.245 

3. IteKs in IPL pool vs. items administered bv IPL tailored tests, = 21.383 

4. Course exam items vs. items in 3PL pool, x " .9978 

5. course exam it*,ms vs. items administered by 3PL tailored tests, » 134.341 



. v....-.«w«*w vj vttXAWAva c><s9i:5, X " 

6. Items in 3PL pool vs. items administered by 3PL tailored tests, x^ « 133.448 
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